Section B

Handling Land Area Data

Handling outliers

County_Complete dataset

Dropping features that their null values are more than half

We will drop features that are highly correlated

Life Expectancy data

Dropping features that its null values are more than half

Merging datasets

Imputing null values

PCA + TSNE

Wyoming and South Dakota are similar to one another

We know the features are not separated properly, but we tried way too many combinations (most of them were based on variance), but we couldn't get to the optimal result

We tried removing every feature and print both graphs to see which was most effective and least effective features, and according to the graph (mainly according to TSNE), removing 'fips'made the separation worse, and removing 'pop2019' and 'sales_per_capita' gave us similar results to the graphs we had before

Section C

We will first build the label

Some population are missing so we will be estimating via this formula

pop_estimate.png

From now on District of Columbia will be dropped

Now we will be calaculating population who's age is over 18, which is number of people who are eligilble to vote

Since the formula used before to estimate population is valid for population between two censuses, another formula is going to be used to estimate poulation, because we don't have information about age18 before 2010

grr.png

gr_3.png

Voter Turnout estimations has been created

Minnesota scores highest in voter turnout percentages

Creating Features

After doing some research on what may affect voter-turnout percentages, we found out that there are some demographic factors that have the most effect:

Gender will not be explored further too due to lack of data

We will be looking at ethnicity features, we spotted 6 ethnicities in the data:

As shown in the plot, the most dominant ethnicities among all states are: White, Black and Hispanic, thus we will be considering them as features

We are going to use 2010+2019 data to extarct the features

We are going to calculate number of people in each ethnic group for each county, and then sum the population so we could get percentage of each ethnicity per state

Calcuating population for each ethnic group in the missing years (above 18), using the same function used above for estimating above18 population

We are going to assume that these percentages are the same for people above the age of 18.

We are now going to calculate population above the age of 18, for each ethnic group.

Now we are going to explore socio-economic features, such as education, employment and income
There is so many features, and we saw in an earlier section that features that are the same, but has different years, are highly correlated, so we will see how other different features , affect each other

From the heatmap above and searching the correlations between features, we found that some features give the same effect and highly correlated, thus we will be dropping them out.

We are now going to calculate these features for people above the age of 18

Joining all features together for prediction part

Prediction

Gradient Boosting Regressor

Random Forrest Regressor

KNN

No feature importance for KNN

Predictions for California, Florida, South Dakota, Wyoming

In order to get some these states to top 25, we dropped these features : 'per_capita_income','bachelors','hs_grad', and we added 'year' , but the overall MSE was worse

Section D

There is no variance in 'runoff' and also it contains many null values, thus it will be dropped

Since we are asked to predict Democrats / Republicans, we will drop other parties

House data features has no variance which means they won't be contributing that much to predict, thus they will be dropped

We will be using same features used in section C

But, since most data is in the 2010s, we will not consider years older than 2000, because predicting missing data for that far may be highly inaccurate results

Joining features together

Prediction

Accuracy is the measure for all classifier algorithms

Algorithms used:

SVC

AdaBoost

Random Forrest Classsifier

KNN Classifier

Section E

We have nulls in columns that might have incorrect data if we tried to impute them, thus they will be dropped

House data features has no variance which means they won't be contributing that much to predict, thus they will be dropped

Merging with labels, and extracting test set

Since 2022 does not have any of the extra features that came from senate table, they will be dropped

SVC

Random Forrest Classifier

KNN Classifier